Analysis and modelling of Airbnb listings in Vancouver, BC, Canada from April 2021

Table of Contents

  1. Introduction
      Description of the data
  1. Wrangling
      Gather
      Assess
      Clean

  2. Exploratory Data Analysis
      Q1: What neighbourhoods are the listings in?
      Q2: What is the distribution of the prices of the listings?
      Q3 What is the distribution of prices by neighbourhood?
      Q4 What is the relationship between quantitative data variables and the price?
      Q5 What is the association between binary variables and price?
      Q6 What is the association between categorical or ordinal variables and price?
      Q7 What is the association between datetime variables and price?
      Q8 Are bedroom numbers highly correlated to price?
      Q9 Are some AirBnBs mispriced?
      Q10 Are review ratings related to the price of listings?

  3. Linear Modelling
      Drop unuseful features
      Convert datetimes to days since the date the data was scraped
      Removing NAs
      Making dummy variables
      Fitting the model and predicting

  4. Attempting to simplify the linear model

  5. Conclusions and Discussion

  6. Resources

Introduction

I chose a dataset of Airbnb listings from Vancouver, Canada spanning the month of April 2021. If a potential Airbnb host wanted to set a price for their listing, they might want to know what everyone else is setting their prices at, and what features of their listing might justify the price.

The following three broad questions guided the rest of the research questions in this analysis:

I obtained data from Inside AirBNB.

An associated Medium post is here.

Description of the data

Inside AirBNB provides 5 files (I've the filenames to avoid redundancy). A data dictionary is here.

Inspecting the data dict, I decided to use only listings_summary.csv and selected features from the listings.csv, as the reviews.csv, calendar.csv, and reviews_summary.csv are pretty well summarized and captured in the listings.csv anyways.

Wrangling

Gather

Assess

listings_summary

This is what listings_summary looks like:

neighbourhood_group is all NaN, so will be dropped.

I'm going to drop the name of the listing as well. Although there might be some correlation between the marketing value in a title, I don't plan on doing any text analysis.

host_name also will be dropped, as host_id will suffice for tracking hosts, if necessary.



listings

I'm going to grab only useful looking features of out this data.

After a cursory inspection, a number of features related to the host, the properties of the room, and the reviews seem interesting.

Clean

I'll now look through the datasets we have and clean them as necessary, i.e. checking for duplicates, dealing with NaN's, and adjusting datatypes.

For clarity, I'll break the cleaning of listings into three parts related to different groups of features before combining them back together again with listings_summary.

Helper functions for data cleaning

First, I'll define some helper functions for converting data types:

Cleaning listings_summary

For listings_summary, last_review needs to be converted to datetimes.

Cleaning listings_host

listings_host is an all string dataframe

I'll need to convert t/f columns to 1.0 and 0.0 representing true and false, convert dates to datetimes, and percentages to ratios.

Cleaning listings_room

listings_room has two columns of note. bathrooms_text are host-entered strings describing their bathroom situation, whereas amenities is are json formatted collections of various amenities the property has.

Inspecting the amenities, each property has a combination of many amenities. These aren't very standardized.

Pulling out the individual amenities, it's a long-tailed distribution (i.e. most amenity types are present in low-count, but there are a lot of them).

One way to deal with this would be to pull out only the most popular amenities.

An arbitrary threshold of having at least a count of 1000 might be reasonable

I'll prefix the columns in amenities_df, then drop the amenities column in listings_room, before glueing amenities_df back on:

Next, I'll extract numeric values from the bathroom text.

All the values have a numeric value in the front, unless they are half-baths. I'll code half-baths as 0.5 bathrooms. I'll ignore the "shared" or "private" status, as it does not seem to be totally correlated with the type of property.

bathrooms is all NaN, so I'll just drop it

property_type and room_type seem to encode similar information, except property_type is much more diverse.

Rather than deal with property_type, I'll just drop it

instant_bookable is t/f representing true/false, so I'll convert that to 1/0's

Cleaning listings_review

The only thing I'll change here is converting dates to datetime format

Putting the dataframe back together

Let's put the cleaned data back into one dataframe.

Check for duplicated rows:

There are two duplicated columns:

Drop the duplicated columns:

Rearrange the columns to put some properties from listings_summary next to other grouped properties

Exploratory Data Analysis

Here I'll answer several research questions to guide our exploration of the data.

Before we begin, let's define a helper function to display links to AirBnB listings:

Q1: What neighbourhoods are the listings in?

Let's use GeoPandas to plot locations onto a map of Vancouver.

First, I'll gather the coordinates of the neighbourhoods, and the latitudes/longitudes of the locations

A coordinate reference system is a set of rules that converts locations to a set of points that can be plotted. EPSG:4326 is a common default, but because I want to use contextily to grab a background map for the plot, I'll need to convert to another coordinate system EPSG:3857 to the contextily maps.

Looks like Downtown is a hotspot of listings, let's confirm:

Q2: What is the distribution of the prices of the listings?

Prices in this dataset are whatever was scraped off the AirBNB website. It does not account for complexities such as surge pricing or fine-grained host adjustments by date, nor additional fees and taxes.

Here is the overall distribution of prices. The prices are right-skewed, with some a smattering of very expensive listings, while the median is 115 CAD per night.

Q3 What is the distribution of prices by neighbourhood?

Q4 What is the relationship between quantitative data variables and the price?

Here, I'll make a correlation matrix of quantitative variables to price.

Many of the numerical columns are actually T/F data encoded as 1/0's. Some are also ordinal. I'll pick out the true quantitative variables, then plot their correlation against the other quantitative variables. I'll plot the correlation coefficients in the top half of the grid.

There are a few patches of higher correlation.

Q5 What is the association between binary variables and price?

Let's take a look at the binary variables (i.e. those that are true/false for a listing).

It's hard to tell too much from these boxplots alone. Some of the variables are very unbalanced between groups. As well, the outliers could be dragging the means by quite a bit (plots are log scale on price).

To help get a sense of how different the groups are between each other per variable, I'll use t-tests and Mood's test (for the mean and the median respectively) on these variables.

These tests are meant here to be interpreted as a ranking statistic, rather than a strict test of significance.

Looking back at the boxplots, most of these amenities do seem to be associated with increased price when present, although some like luggage dropoff aren't.

Q6 What is the association between categorical or ordinal variables and price?

There's actually only two categorical variables, each with 4 options. host_response_time is ordinal, room_type may or may not be ordinal.

There does not seem to be strong association with any host_response_time category and price. However, the entire home/apt category has the highest prices. This is similar to how variables related to the size of the listing is related to price.

Q7 What is the association between datetime variables and price?

There doesn't appear to be much association between how long a host has been on AirBnb, how long ago the first review of a listing is, or how long ago a listing has been reviewed, with the price of the listing.

Q8 Are bedroom numbers highly correlated to price?

In Q4, we saw that the size of a listing is moderately correlated with its price. Let's take a closer look at this relationship by plotting bedrooms and beds vs. price.

The listings generally follow the trend of more bedrooms/beds having a higher price. However there are some outliers and anomalies.

There is a listing that has 13 bedrooms/beds. Going to the listing (see the link), it appears to be a joke listing for a haunted house, with no availability in the upcoming year.

There's a 12 bed but two bedroom listing that is "dorm" style:

Some listings have 0 beds but non-zero bedrooms. This seems to be oversights on the part of the host, and indeed some listings have already amended their number of bed (e.g. listing 322824).

There are a number of extremely expensive one bedroom listings. This are explored below.

Q9 Are some AirBnBs mispriced?

Let's take a closer look at one bedroom prices.

Some of the outlier have very high prices:

Listing 37181228 and 44306270 look like they're actually a monthly rental price that's been mispriced as a nightly price. Notably they have a 6 month minimum stay.

I believe the other 3 listings are similarly mispriced, based on my knowledge of Vancouver monthly rental prices. Notably listing 5089343 says they are a 6 month sublet even though its not annotated in their minimum night limit.

First, let's take a look at the number of reviews a listing has versus its rating.

There is a notable absence of lower ratings as the number for reviews increase. This is due to consistently low ratings getting listings removed from AirBnB.

Plotting price against the review ratings shows there isn't much relationship between prices and ratings, although some of the higher priced outliers do have better ratings. Note the absence of some of the most expensive properties that don't have reviews.

Removing very low ratings doesn't improve the correlation much.

Linear modelling

Let's see if we can fit a simple linear model to predict the price based on features of the listing.

Drop unuseful features

First, let's drop some features out of the model that probably shouldn't be used to predict.

Convert datetimes to days since the date the data was scraped

Let's convert datetimes to a quantitative measure of how many days its been since that date.

Removing NAs

Now, let's see how many NA's we have in this data set.

Looking at the webpages for some of the NAs for features such as host_since, it seems like these NAs might be from scraping problems, as the hosts seem to have the information on the AirBnb website.

Based on this and my EDA results for Q4, Q5, and Q7, I'm gonna try to save some trouble and drop a lot of the host-related columns.

How shall we fill the NA's?

For review scores, the median or mean makes sense.

For first_review or last_review, and reviews_per_month, NA's basically mean there's no reviews, so 0 makes sense.

For beds, bedrooms, and bathrooms_text, I could potentially try to figure out if I could infer some number for a missing feature based on the other two features. I took a look at the value counts for listings missing one of these features though, and decided it'd probably just be easier to fill the NAs with the median, which is 1 for all three features.

Making dummy variables

For categorical features, we'll have to make dummy variables.

Fitting the model and predicting

OK, now that we have that data ready, let's:

  1. Split into X and y (independent and dependent variables)
  2. Split the data into a training and a test set
  3. Fit a linear model

The r squared score suggests it's moderately predictive. Let's take a look at the predictions.

Looks like the model is under predicting higher priced listings (probably because there are fewer of them and maybe because the higher priced listings are increasing non-linearly (e.g. long tail by power law).

Also, the model is underpredicting some of the lower priced listings, into the negatives!

Let's take a look at the model coefficients:

The biggest positive predictor of price is the number of bedrooms whereas the biggest negative predictive of price is whether the room type is a hotel room.

Attempting to simplify the linear model

I'm curious as to how good the linear model is if you use a more minimal set of features.

As a guess, I'm going to try just with bedrooms, room_type, and neighbourhood.

The model really does look worse than using all the features (esp. comparing the r2 score of 0.32 vs 0.41).

Conclusions and Discussion

To summarize some key findings of the exploratory data analysis for AirBnb listings for the month of April in Vancouver:

I was able to fit a linear model that was moderately predictive of listing pricing (r2 of 0.41). However, this model needs to be tested on more data such as listings outside the month of April 2021. Indeed, it would be desireable to fit (probably more complex) models on a larger data set from a longer timei period, and use it to predict prices going forward.

Since the data is taken during the COVID-19 pandemic and associated travel restrictions, it is unclear if this model is generalizable to listings during non-pandemic times.

Resources